WEEK 4: DATA JOINS AND TRANSFORMATIONS

Monday, January 30th

Today we will…

Data Layouts

Tidy Data

Image source: R4DS

Untidy “Messy” Data

Illustration by Allison Horst

Our tools!

Illustration by Allison Horst

Collaboration with tidy data.

Illustration by Allison Horst

Pivoting Data

Tidyexpalin animation by Kelsey Gonzalez

Manual Method

We can do the wide-to-long transition manually.

Consider the table of daily rainfall observed in SLO in January 2023. This data is recorded in human-friendly form, in the approximate shape of a calendar. Each week has its own row, and each day has its own column.

Talk to a neighbor about how you would convert this to long format. You may want to open up the spreadsheet containing this table on your computer.

Data source

Manual Method: Steps

  1. Create a new column: Day_of_Week.

  2. Create a Rainfall column to hold the daily rainfall values.

  3. Now we have three columns setup (Week, Day_of_Week, and Rainfall) – start moving data over.

  4. Duplicate repeated data (Week 1-5) and copy Monday over.

  5. Duplicate repeated data (Week 1-5) and copy Tuesday over.

  6. Continue for the rest of the days of the week.

  7. You may want to arrange() the data by week to get chronological order of the rainfall values.

Computational Approach

pivot_longer()

library(readxl)
slo_rainfall <- read_xlsx("2023-rainfall-slo.xlsx")

slo_rainfall_long <- slo_rainfall |> 
  mutate(across(Sunday:Saturday, na_if, y = "NA"),
         across(Sunday:Saturday, as.numeric)
         ) |> 
  pivot_longer(cols      = Sunday:Saturday,
               names_to  = "Day_of_Week",
               values_to = "Daily_Rainfall")
slo_rainfall_long
# A tibble: 35 × 3
    Week Day_of_Week Daily_Rainfall
   <dbl> <chr>                <dbl>
 1     1 Sunday                0   
 2     1 Monday                0.12
 3     1 Tuesday               0   
 4     1 Wednesday             1.58
 5     1 Thursday              0.91
 6     1 Friday                0   
 7     1 Saturday              0.05
 8     2 Sunday                0.27
 9     2 Monday                4.26
10     2 Tuesday               0.43
# … with 25 more rows

Why tidy data?

library(liver)
data(cereal)
str(cereal, hide.attr = TRUE)
'data.frame':   77 obs. of  16 variables:
 $ name    : Factor w/ 77 levels "100% Bran","100% Natural Bran",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ manuf   : Factor w/ 7 levels "A","G","K","N",..: 4 6 3 3 7 2 3 2 7 5 ...
 $ type    : Factor w/ 2 levels "cold","hot": 1 1 1 1 1 1 1 1 1 1 ...
 $ calories: int  70 120 70 50 110 110 110 130 90 90 ...
 $ protein : int  4 3 4 4 2 2 2 3 2 3 ...
 $ fat     : int  1 5 1 0 2 2 0 2 1 0 ...
 $ sodium  : int  130 15 260 140 200 180 125 210 200 210 ...
 $ fiber   : num  10 2 9 14 1 1.5 1 2 4 5 ...
 $ carbo   : num  5 8 7 8 14 10.5 11 18 15 13 ...
 $ sugars  : int  6 8 5 0 8 10 14 8 6 5 ...
 $ potass  : int  280 135 320 330 -1 70 30 100 125 190 ...
 $ vitamins: int  25 0 25 25 25 25 25 25 25 25 ...
 $ shelf   : int  3 3 3 3 3 1 2 3 1 3 ...
 $ weight  : num  1 1 1 1 1 1 1 1.33 1 1 ...
 $ cups    : num  0.33 1 0.33 0.5 0.75 0.75 1 0.75 0.67 0.67 ...
 $ rating  : num  68.4 34 59.4 93.7 34.4 ...
cereal_summary1 <- cereal |> 
  group_by(shelf) |> 
  summarise(across(calories:vitamins, mean))
cereal_summary1
# A tibble: 3 × 10
  shelf calories protein   fat sodium fiber carbo sugars potass vitamins
  <int>    <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>  <dbl>  <dbl>    <dbl>
1     1     102.    2.65  0.6    176. 1.68   15.8   4.8    75.5     20  
2     2     110.    1.90  1      146. 0.905  13.6   9.62   57.8     23.8
3     3     108.    2.86  1.25   159. 3.14   14.5   6.53  130.      35.4

cereal_summary2<- cereal |> 
  pivot_longer(cols = calories:vitamins,
               names_to = "Nutrient",
               values_to = "Amount") |> 
  group_by(shelf, Nutrient) |> 
  summarise(mean_amount = mean(Amount))
cereal_summary2
# A tibble: 27 × 3
# Groups:   shelf [3]
   shelf Nutrient mean_amount
   <int> <chr>          <dbl>
 1     1 calories      102.  
 2     1 carbo          15.8 
 3     1 fat             0.6 
 4     1 fiber           1.68
 5     1 potass         75.5 
 6     1 protein         2.65
 7     1 sodium        176.  
 8     1 sugars          4.8 
 9     1 vitamins       20   
10     2 calories      110.  
# … with 17 more rows
cereal_summary2 |> 
  ggplot(aes(x = shelf, 
             y = mean_amount, 
             color = Nutrient)) +
  geom_point() +
  geom_line() +
  labs(x = "Shelf", y = "", subtitle = "Mean Amount")

pivot_wider()

mean_protein <- cereal %>% 
  group_by(manuf, shelf) %>% 
  summarize(mean_protein = mean(protein))
mean_protein
# A tibble: 18 × 3
# Groups:   manuf [7]
   manuf shelf mean_protein
   <fct> <int>        <dbl>
 1 A         2         4   
 2 G         1         3   
 3 G         2         1.29
 4 G         3         2.67
 5 K         1         2.75
 6 K         2         2.14
 7 K         3         2.92
 8 N         1         2.67
 9 N         2         2.5 
10 N         3         4   
11 P         1         1.5 
12 P         2         1   
13 P         3         3   
14 Q         1         5   
15 Q         2         2   
16 Q         3         2.5 
17 R         1         2   
18 R         3         3   
protein_wide <- mean_protein |> 
  pivot_wider(id_cols = manuf,
              names_from = shelf,
              values_from = mean_protein)
protein_wide
# A tibble: 7 × 4
# Groups:   manuf [7]
  manuf   `1`   `2`   `3`
  <fct> <dbl> <dbl> <dbl>
1 G      3     1.29  2.67
2 K      2.75  2.14  2.92
3 N      2.67  2.5   4   
4 P      1.5   1     3   
5 Q      5     2     2.5 
6 R      2    NA     3   
7 A     NA     4    NA   

Better names in pivot_wider()

protein_wide <- mean_protein |> 
  pivot_wider(id_cols = manuf,
              names_from = shelf,
              values_from = mean_protein,
              names_prefix = "Shelf ")
protein_wide
# A tibble: 7 × 4
# Groups:   manuf [7]
  manuf `Shelf 1` `Shelf 2` `Shelf 3`
  <fct>     <dbl>     <dbl>     <dbl>
1 G          3         1.29      2.67
2 K          2.75      2.14      2.92
3 N          2.67      2.5       4   
4 P          1.5       1         3   
5 Q          5         2         2.5 
6 R          2        NA         3   
7 A         NA         4        NA   

Data Joins

Relational Data

Multiple tables of data are called relational data because it is the relations, not just the individual data sets, that are important.

IMDb movie relational data

Data Joins

Mutating joins

Adds information from a new dataframe to observations in an existing dataframe

full_join(), left_join(), right_join(), inner_join(), outer_join()

Filtering Joins

Filters observations based on values in new dataframe

semi_join(), anti_join()

Keys

Uniquely identifies an observation in a data set

Relate data sets to each other

inner_join()

Matches pairs of observations when “keys” are equal

Inner Join: IMDb Example

directors_genres
# A tibble: 10 × 3
# Groups:   director_id [5]
   director_id genre      prob
         <int> <chr>     <dbl>
 1         429 Adventure 0.75 
 2         429 Fantasy   0.75 
 3        2931 Drama     0.714
 4        2931 Action    0.429
 5       11652 Sci-Fi    0.5  
 6       11652 Action    0.5  
 7       14927 Animation 1    
 8       14927 Family    1    
 9       15092 Comedy    0.545
10       15092 Crime     0.545
movies_directors
  director_id movie_id
1         429   300229
2        9247   124110
3       11652    10920
4       11652   333856
5       14927   192017
6       15092   109093
7       15092   237431
inner_join(directors_genres, movies_directors)
# A tibble: 12 × 4
# Groups:   director_id [4]
   director_id genre      prob movie_id
         <int> <chr>     <dbl>    <int>
 1         429 Adventure 0.75    300229
 2         429 Fantasy   0.75    300229
 3       11652 Sci-Fi    0.5      10920
 4       11652 Sci-Fi    0.5     333856
 5       11652 Action    0.5      10920
 6       11652 Action    0.5     333856
 7       14927 Animation 1       192017
 8       14927 Family    1       192017
 9       15092 Comedy    0.545   109093
10       15092 Comedy    0.545   237431
11       15092 Crime     0.545   109093
12       15092 Crime     0.545   237431

Inner Join: IMDb Example

What if our key variable is not named the same?

directors_genres
# A tibble: 10 × 3
# Groups:   director_id [5]
   director_id genre      prob
         <int> <chr>     <dbl>
 1         429 Adventure 0.75 
 2         429 Fantasy   0.75 
 3        2931 Drama     0.714
 4        2931 Action    0.429
 5       11652 Sci-Fi    0.5  
 6       11652 Action    0.5  
 7       14927 Animation 1    
 8       14927 Family    1    
 9       15092 Comedy    0.545
10       15092 Crime     0.545

Directors: 429, 2931, 9247, 11652, 14927, 15092

directors
     id first_name last_name
1   429     Andrew   Adamson
2  9247       Zach     Braff
3 11652  James (I)   Cameron
4 14927        Ron  Clements
5 15092      Ethan      Coen

Directors: 429, 2931, 9247, 11652, 14927, 15092

inner_join(directors, 
           directors_genres, 
           by = c("id" = "director_id")
           )
     id first_name last_name     genre     prob
1   429     Andrew   Adamson Adventure 0.750000
2   429     Andrew   Adamson   Fantasy 0.750000
3 11652  James (I)   Cameron    Sci-Fi 0.500000
4 11652  James (I)   Cameron    Action 0.500000
5 14927        Ron  Clements Animation 1.000000
6 14927        Ron  Clements    Family 1.000000
7 15092      Ethan      Coen    Comedy 0.545455
8 15092      Ethan      Coen     Crime 0.545455

Directors: 429, 2931, 9247, 11652, 14927, 15092

Mutating Joins

  • left_join() Everything is kept in the data set on the left

  • right_join() Everything is kept in the data set on the right

  • full_join() Everything is kept in both data sets

Mutating Joins

Discuss with a neighbor.

Which of the following directors would be kept in for each of:

  • left_join(directors_genres, movies_directors)
  • right_join(directors_genres, movies_directors)
  • full_join(directors_genres, movies_directors)
directors_genres |> 
  distinct(director_id)
# A tibble: 5 × 1
# Groups:   director_id [5]
  director_id
        <int>
1         429
2        2931
3       11652
4       14927
5       15092
movies_directors |> 
  distinct(director_id)
  director_id
1         429
2        9247
3       11652
4       14927
5       15092

Filtering Joins: semi_join()

semi_join() Keeping observations

semi_join(directors_genres, movies_directors)
# A tibble: 8 × 3
# Groups:   director_id [4]
  director_id genre      prob
        <int> <chr>     <dbl>
1         429 Adventure 0.75 
2         429 Fantasy   0.75 
3       11652 Sci-Fi    0.5  
4       11652 Action    0.5  
5       14927 Animation 1    
6       14927 Family    1    
7       15092 Comedy    0.545
8       15092 Crime     0.545

Movie Directors: 429, 2931, 11652, 14927, 15092

Including observations with %in%

directors_genres |>
  filter(director_id %in% c(429, 9247, 11652, 14927, 15092))
# A tibble: 8 × 3
# Groups:   director_id [4]
  director_id genre      prob
        <int> <chr>     <dbl>
1         429 Adventure 0.75 
2         429 Fantasy   0.75 
3       11652 Sci-Fi    0.5  
4       11652 Action    0.5  
5       14927 Animation 1    
6       14927 Family    1    
7       15092 Comedy    0.545
8       15092 Crime     0.545

similar to semi_join()!

Filtering Joins: anti_join()

anti_join() Removing Observations

anti_join(directors_genres, movies_directors)
# A tibble: 2 × 3
# Groups:   director_id [1]
  director_id genre   prob
        <int> <chr>  <dbl>
1        2931 Drama  0.714
2        2931 Action 0.429

Movie Directors: 429, 2931, 11652, 14927, 15092

Excluding observations with !%in%

directors_genres |>
  filter(!director_id %in% c(429, 9247, 11652, 14927, 15092))
# A tibble: 2 × 3
# Groups:   director_id [1]
  director_id genre   prob
        <int> <chr>  <dbl>
1        2931 Drama  0.714
2        2931 Action 0.429

similar to anti_join()!

PA 4: Military Spending

Today you will be tidying untidy data to explore the relationship between countries of the world and military spending.

Due Wednesday, 2/1 at 8:00am

Bonus Challenge: Murder Mystery in SQL City

For this challenge, you will be using table joins to solve a murder mystery.

Due Sunday, 2/12 at 11:59pm

To do…

  • PA 4: Military Spending
    • Due Wednesday 2/1 at 8:00am
  • Bonus Challenge: Murder Mystery in SQL City
    • Due Sunday 2/12 at 11:59pm

Wednesday, January 25th

Today we will…

  • Review PA 4: Military Spending
  • Practice with Relational Data
  • Lab 4: Avocado Prices
  • Challenge 3: Avocado Toast Ate My Mortgage

Lab + Challenge

Lab 4: Avocado Prices + Challenge 4: Avocado Toast Ate My Mortgage


Handy Helpers

rename() – Change names of columns

separate() – Separate values of a variable


Filtering Joins

semi_join(): Keeps values found in another data set

anti_join(): Keeps values not found in another data set

To do…

  • Lab 4: Avocado Prices
    • Due Friday, 2/3 at 11:59pm
  • Challenge 4: Avocado Toast Ate My Mortgage
    • Due Saturday, 2/4 at 11:59pm
  • Read Chapter 5: Special Data Types
    • Concept Check 5.1 + 5.2 + 5.3 due Monday (2/6) at 8:00am